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Abstract: Birnbaum's theorem, that the sufficiency and conditionahty princi- 
ples entail the likelihood principle, has engendered a great deal of controversy 
and discussion since the publication of the result in 1962. In particular, many 
have raised doubts as to the validity of this result. Typically these doubts are 
concerned with the validity of the principles of sufficiency and conditionahty 
as expressed by Birnbaum. Technically it would seem, however, that the proof 
itself is sound. In this paper we use set theory to formalize the context in which 
the result is proved and show that in fact Birnbaum's theorem is incorrectly 
stated as a key hypothesis is left out of the statement. When this hypothesis is 
added, we see that sufficiency is irrelevant, and that the result is dependent on 
a well-known flaw in conditionahty that renders the result almost vacuous. 
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1 Introduction 

A result presented in Birnbaum (1962), and referred to as Birnbaum's theorem, 
is very well-known in statistics. This result says that a statistician who accepts 
both the sufficiency S and conditionahty C principles must also accept the 
likelihood principle L and conversely. The result has always been controversial 
primarily because it implies that a frequentist statistician who accepts 5* and C is 
forced to ignore the repeated sampling properties of any inferential procedures 
they use. Given that both S and C seem quite natural to many frequentist 
statisticians while L does not, the result is highly paradoxical. 

Various concerns have been raised about the proof of the result. For example, 
Durbin (1970) argued that the theorem fails to hold whenever C is restricted 
by requiring that any ancillaries used must be functions of a minimal sufficient 
statistic. Kalbfleisch (1975) argued that C should only be applicable when 
the value of the ancillary statistic used to condition is actually a part of the 
experimental make-up. This is called the weak conditionahty principle. In 
Evans, Fraser and Monette (1986) it is argued that Birnbaum's theorem, and 
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a similar result that accepting C alone is equivalent to accepting L, arc invalid 
because the specific uses of S and C in proving these results can be seen to be 
based on flaws in their formulations. For example, Birnbaum's theorem requires 
a use of S and C where the information discarded by S as irrelevant, which is 
the primary motivation for S, is exactly the information used by C to condition 
on and so identifies the discarded information as highly relevant. As such S and 
C contradict each other. We note that this is precisely what Durbin's restriction 
on the ancillaries avoids. Furthermore, the result that C alone implies L can be 
seen to depend on the lack of a unique maximal ancillary which can be viewed 
as an essential flaw in C. Also, see Holm (1985), Barndorff-Nielsen (1995) 
and Helland (1995) for various concerns about the formulation of the theorem. 
Mayo (2010) argues that, in the context of a repeated sampling formulation for 
statistics, we cannot simultaneously have S and C true, as when S is true then 
C is false and when C is true then S is false. Gandenberger (2012) offers up a 
proof that avoids some of the objections raised by others. 

Many of these reservations are essentially with the hypotheses to the theorem 
and suggest that Birnbaum's theorem should be rejected because the hypotheses 
are either not acceptable or have been misapplied. It is the purpose of this pa- 
per to provide a careful set-theoretic formulation of the context of the theorem. 
When this is done we see that there is a hypothesis that needs to be formally 
acknowledged as part of the statement of Birnbaum's theorem. With this ad- 
dition, the force of the result is lost and the paradox disappears. The same 
conclusions apply to result that C is equivalent to L and, in fact, this is really 
the only result as S is redundant in Birnbaum's theorem when the additional 
hypothesis is formally acknowledged. 

For our discussion it is important that we stick as closely as possible to 
Birnbaum's formulation. To discuss the proof, however, we have to make cer- 
tain aspects of Birnbaum's argument mathematically precise that are somewhat 
vague in his paper. It is always possible then that someone will argue that we 
have done this in a way that is not true to Birnbaum's intention. We note, 
however, that this is accomplished in a very simple and direct way. If there is 
another precise formulation that makes the theorem true, then it is necessary 
for a critic of how we do this to provide that alternative. 

A basic step missing in Birnbaum (1962) was to formulate the principles 
as relations on the set I of all model and data combinations. So I is the set 
of all inference bases I = {E,x) where E = {XE,{fE,e '■ G O^}), Xe is a 
sample space, {fs^s : G 0_e} is a collection of probability density functions 
on Xe, with respect to some support measure i^e on Xe, and x G Xe is the 
observed data. We will ignore all measure-theoretic considerations as they are 
not essential for any of the arguments. If the reader is concerned by this, then 
we note that the collection of models where Xe and 8^; are finite and he is 
counting measure is rich enough to produce the paradoxical result. So in general 
we can consider our discussion restricted to the case where Xe and Q e arc finite. 
It is our view that infinite sets and continuous probability measures are not 
necessary for the development of the basic principles of statistics. Rather the 
use of infinite sets and continuity represents approximations to a finite reality 
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and appropriate restrictions must be employed on such quantities so that we are 
not mislead by purely mathematical considerations. In spite of our restrictions, 
most of our development applies equally well under very general circumstances. 

Wc note that expressing the principles as relations was part of Evans, Frascr 
and Monette (1986) this is taken further here. In Section 2 we discuss the 
meaning and use of relations generally. In Section 3 we apply our discussion of 
relations to Birnbaum's theorem. In Section 4 we draw some conclusions. 

2 Relations 

A relation R with domain D is a subset R C D x D. Saying (x, y) <E R means 
that the objects x and y have a property in common. For example, suppose D 
is the set of students enrolled at a specific university at a specific point in time. 
Let Ri be defined by (.x, y) G Ri whenever x and y are students in the same 
class. Let i?2 be defined by (x, y) G i?2 whenever x and y have taken a course 
from the same professor. 

A relation R is reflexive if (x,x) G R for all x G -D, symmetric if (x,?y) G R 
implies {y, x) G R, and transitive if (x, y) G R, {y, z) € R implies that (x, z) G 
R. If a relation R is reflexive, symmetric and transitive, then R is called an 
equivalence relation. Clearly Ri is an equivalence relation and, while i?2 is 
reflexive and symmetric, it is not typically transitive and so is not an equivalence 
relation. While {x,y) G R implies that x and y are related, perhaps by the 
possession of some property, when R is an equivalence relation this implies that 
X and y possess the property to the same degree. We say that relation R on D 
implies relation R' on D whenever Rc R'. Clearly we have that i?i C i?2• 
If i? is a relation on D, then the equivalence relation R generated by R is 
the smallest equivalence relation containing R. We see that R is the intersection 
of all equivalence relations on D containing R. Also we have that 

R = {(x, y) :3n, xi, . . . ,Xn ^ D with x = xi, y = x„ and 

(xi,Xi+i) G i? or (xi+i,Xi) G -R}. (1) 

It is not always clear that R has a meaningful interpretation, at least as 
it relates to the property being expressed by R. For example, R2 is somewhat 
more difficult to interpret and surely goes beyond the idea that R2 is perhaps 
trying to express, namely, that two students were directly influenced by the 
same professor. In fact, it is entirely possible that R2 = D x D. As another 
example, suppose that D = {2,3,4,...} and {x,y) G R when x and y have 
a common factor bigger than 1. Then R is reflexive and symmetric but not 
transitive. If x, y G .D then (x, xy) G R, {xy, y)GR so R = DxD and R is 
saying nothing. It seems that each situation, where we extend a relation R to 
an equivalence relation, must be examined to see whether or not this extension 
has any meaningful content for the application. 

Now suppose we have relations i?i and R2 on D and consider the relation 
Ri U i?2- The following result is relevant to our discussion in Section 3. 
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Lemma 1. i?i U i?2 = U i?2- 

Proof: W e ha ve that R i U R2 C Ri _U R2 so -Ri U R2 C R1UR2 while Ri C 
Ri U R2, -R2 C -Ri U i?2 implies i?i U i?2 C i?i U i?2- 

This says that the equivalence relation generated by the union of relations is 
equal to the equivalence relation generated by the union of the correspond- 
ing generated equivalence relations. Furthermore, it is clear that the union of 
equivalence relations is not in general an equivalence relation. 

3 Statistical Relations and Principles 

We define a statistical relation to bo a relation on I and a statistical principle to 
be an equivalence relation on I. The idea behind a statistical principle, as used 
here, is that equivalent inference bases contain the same amount of statistical in- 
formation about the unknown 9. We make no attempt to give a precise definition 
of what statistical information means. Birnbaum (1962) identified two inference 
bases /i , /2 G X as containing the same amount of statistical information via 
the notation Ev(Ii) = Ev{l2). We consider several statistical relations. 

The likelihood relation X on I is defined by (/i , ^2) G L whenever @Ei = ©Ea 
and there exists c > such that fEi,e{xi) = cfE2,e{x2) for every 9. We have the 
following obvious result. 

Lemma 2. _L is a statistical principle. 

Actually the likelihood principle does not completely express the idea that 
two inference bases with the same likelihood function contain the same amount 
of statistical information. For this we need another statistical relation. We 
define the invariance relation G by (/i,/2) G G whenever there exist 1-1, onto, 
smooth functions g : Xei ^e^j^ : Sei Be^ with g{xi) = X2 and 
such that fEi.eix) = .fE2M0){(){-f:))JgH^) for every x £ Xei where Jg{x) = 
{dct{dg{x) /dx))^^ = 1 in the discrete case. We have the following result. 

Lemma 3. G is a statistical principle. 

Now consider the equivalence relation L U G. If (Ii, 12) € i and {I2, 13) G G, 
then, for some constant c > and mappings g and h, fEi,6{xi) = cfE2,e{x2) = 
cfE3Me)i9{x3))Jg^{x3) = c'fEs,h(e)i9{x3)) and, so after relabelling, Ji and I3, 
have proportional likelihoods. Similarly, if (Iijh) S G and {12,13) € L, then 
again, after relabelling, Ji and I3 have proportional likelihoods. So {hjh) G 
LuG just expresses the fact that Ji and I2 have proportional likelihoods, per- 
haps after relabelling the data and the parameter. In this case we can state 
clearly what the eqiiivalcncc relation L U G expresses and the generated equiv- 
alence relation makes sense. We do not need LUG, however, for a discussion 
of Birnbaum's result. 

The sufficiency relation S is defined by (/i,/2) G S whenever Oei = Bb^ 
and there exist minimal sufficient statistics mi for Ei and m2 for E2 such that 
the marginal models induced by the rui are the same and mi(xi) = TO2(a;2). We 
have the following result. 

Lemma 4. 5 is a statistical principle and S C L. 
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Proof: Clearly S is reflexive and symmetric and S C L. Suppose (Ji,/2) G 5* 
via the minimal sufficient statistics mi and m2 and {I^,!^) G S via the minimal 
sufficient statistics m'2 and mz- Since any two minimal sufficient statistics are 
1-1 functions of each other, there exists 1-1 function h such that = ho m2. 
Then (/i, /a) G S via the minimal sufficient statistics ho mi and 77x3. 
Obviously we have the result that {Iijh) € S whenever I2 can be obtained from 
Ii via a sufficient statistic or conversely. Furthermore, it makes sense to extend 
S to STjG. 

The conditionality relation C is defined by (/i, ^2) € C whenever = O^si 
Xi = X2 and there exists ancillary statistic a for i^i such that the conditional 
model given a{xi) is given by E2 or with roles of /i and I2 reversed. We have 
the following result. 

Lemma 5. C is reflexive and symmetric but is not transitive and C C L. 
Proof: The reflexivity, symmetry and C <Z L are obvious. The lack of transitiv- 
ity follows via a simple example. Consider the model E with Xe = {1, 2}^, 0^; = 
{1,2} and with fsfi given by Table 1. Now note that U{x\,X2) = x\ and 



{X1,X2) 


(1,1) 


(1,2) 


(2,1) 


(2,2) 


fE.l{xi,X2) 


1/6 


1/6 


2/6 


2/6 


fE,2{xi,X2) 


1/12 


3/12 


5/12 


3/12 



Table 1: Unconditional distributions. 



V{xi,X2) = X2 are both ancillary and the conditional models, when we observe 
{xi,X2) = (1. !)• arc given by Tables 2 and 3. 



{X1,X2) 


(1,1) 


(1,2) 


(2,1) 


(2,2) 


fE,l{xi,X2 \ U =1) 


1/2 


1/2 








fEa{^\,X2 1 = 1) 


1/4 


3/4 








Table 2: Conditional distributions given U = 1. 


{X1,X2) 


(1,1) 


(1,2) 


(2,1) 


(2,2) 


.fe,i(a;i,X2 1 V = 1) 


1/3 





2/3 





/£;,2(xi,a;2 V = 1) 


1/6 





5/6 






Table 3: Conditional distributions given V = 1. 



The only ancillary for both these conditional models is the trivial ancillary 
(the constant map). Therefore, there are no applications of C that lead to 
the inference base I2, given by Table 2 with data (1,1), being related to the 
inference base I3, given by Table 3 with data (1, 1). But both of I2 and are 
related under C to the inference base Ji given by Table 1 with data (1, 1). This 
establishes the result. 

Note that even under relabellings, the inferences bases I2 and I3 in Lemma 5 
are not equivalent. 
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If we arc going to say that (/i,/2) £ C means that Ii and I2 contain an 
equivalent amount of information under C, then we are forced to expand C to C 
so that it is an equivalence relation. But this implies that the two inference bases 
I2 and Is presented in the proof of Lemma 5 contain an equivalent amount of 
information and yet they are not directly related via C. Rather they are related 
only because they are conditional models obtained from a supermodel that has 
two essentially different maximal ancillarics. 

Saying that such models contain an equivalent amount of statistical infor- 
mation is clearly a substantial generalization of C. Note that, for the example in 
the proof of Lemma 5, when (1, 1) is observed, the MLE is ^(1, 1) = 1. To mea- 
sure the accuracy of this estimate we can compute the conditional probabilities 
based on the two inference bases, namely, 

Pi{e{xi,X2) = l\U = l) = l/2,P2(^(xi,X2) = 2 I [/ = 1) = 3/4 
Pi{e{xi,X2) = l\V=l) = lAP2(^(xi,X2) = 2|F= 1) = 5/6 

and so the accuracy of 9 is quite different depending on whether we use I2 or I^- 
It seems unlikely that we would interpret these inference bases as containing an 
equivalent amount of information in a frcqucntist formulation of statistics. As 
noted in Section 2, there is no reason why we have to accept the equivalences 
given by a generated equivalence relation unless we are certain that this equiv- 
alence relation expresses the essence of the basic relation. It seems clear that 
there is a problem with the assertion that (/i,/2) G C means that I\ and I2 
contain an equivalent amount of information without further justification. 

Wc now follow a development similar to that found in Evans, Eraser and 
Monette (1986) to prove the following result. 

Theorem 6. C C C = L where the first containment is proper. 
Proof: Clearly C C C and this containment is proper by Lemma 5. If (/i, I2) G 
C, then (1) implies {h^h) S L since C C L and so C C L. Now suppose that 
(Ii^h) S L. We have that fEi,e{xi) — cfE2,e{x2) for every 9 for some c > 0. 
Assume first that c > 1. Now construct a new inference base /j* = {E^, (l,a;i)) 
where Xe* = {0, 1} x Xe^, and {/b',61 : 9 € &Ei} is given by Table 4 where 
a^io, a^iooi • • • are the elements of not equal to xi and p e [0,1) satisfies 
p/{l — p) = 1/c. Then we see that U{i,x) = i is ancillary as is V given by 







XlQ 


x\m 




i = 1 


pfEi,e{xi) 


pfEi,e{xio) 


pfEi.eixiQo) 




i = 


- P - pfEi,e{xi) 


pfEue{xi) 








Table 4: The model E^. 



V{i, x) = 1 when x = xi and V{i, x) = otherwise. Conditioning on U{i, a;) = 1 
gives that {II, Ii) G C while conditioning on V{i,x) = 1 gives that {II, I) £ C 
where I = {{{0,1}, {pe : 9 £ O^;^}),!) and pe is the Bernoulli(/£;i,0(a;i)/c) 
probability function. Now, using I2 we construct /| by replacing p by 1/2 and 
fEi,e{xi) by fE2,e{x2) in Table 4 and obtain that {I2, 1) £ C since f Ei,e{xi) / c = 
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fE2,e{x2)- Using (1) wc have that (Iijh) G C. If c < 1 we start the construction 
process with I2 instead. This proves that C = L. 

The proof that L c C rehes on discreteness. This was weakened in Evans, 
Fraser and Monette (1986) and even further weakened in Jang (2011). 

We now show that Birnbaum's proof actually establishes the following result. 

Theorem 7. SUC cLc S~UC 

Proof: The first containment is obvious. For the second suppose that {hjh) S 
L. We construct a new inference base I = (E,y) from 7i and I2 as follows. Let 
E be given by Xe = (1, Xe,) U (2, Xe,), 



is sufficient for E and so {{E,{l,x)),{E,{2,x))) e S* by the comment after 
Lemma 4. Also, h{i, x) = iis ancillary for E and thus {{E, (1, Xi)), {Ei,xi)) S C 
and {{ E, (2,^2)), (^^2,2:2)) G C. Then by (1) we have that iiEi,xi), {E2,X2)) e 
SuC and we are done. 

Note that Birnbaum's proof only proves the containments with no equalities but 

we have the following result. 

Theorem 8. S UC is properly contained in L while L = S U C. 
Proof: To show that SuC C Lis proper, suppose that Ei is the Bernoulli(^), 6 S 
(0, 1] model, E2 is the Gcomctric((?), e (0, 1] model and we observe Xi = 1 and 
X2 = so fEi,e{^) = = /b2,6i(0). Note that the full data is minimal sufficient 
for both El and E2 with Xe^ = {0, l},^'^^ = {0,1,2,...} and further that 
both of these models have only trivial ancillaries. Therefore, if li = (Ei, 1) we 
have that (/i,/2) ^ S,{Ii,l2) ^ C but (Ji, J2) G L which proves that S U C is 
properly contained in L. 

To prove that the second containment is exact wc have, rising (1), that 
{h,l2) G Sue implies that Ii and I2 give rise to proportional likelihoods as 
this is true for each element of 5 U C and so S Li C C L. 

So we do not have, as usually stated for Birnabum's theorem, that S and 

C are together equivalent to L but we do have that S' U C is equivalent to L. 
Acceptance of 5 U C is not entailed, however, by acceptance of both S and C 
as we have to examine the additional relationships added to 5 U C to see if 
they make sense. If one wishes to say that acceptance of S and C implies the 
acceptance of 5 U C, then a compelling argument is required for these additions 
and this seems unlikely. From the example of the proof of Theorem 8 we can 
see that acceptance of S* U C is indeed equivalent to acceptance of L. 
From Theorems 6 and Theorem 7 we have the following Corollary. 




{1/2) fE^^eix) when x G Xe, 
otherwise, 

{l/2)fE2,eix) when x G Xe2 




otherwise. 



Then 




{i,x) when X ^ {xi, 0:2} 
xi,X2} otherwise 



7 



Corollary 9. S U C C C = L where the first containment is proper. Further- 
more, S C C and this containment is proper. 

A direct proof that S cC has been derived by Jang (2011). It is interesting to 
note that Corollary 9 shows that the existence of S in the modified statement 
of Birnbaum's theorem, where we require that we accept all the equivalences 
generated by S and C, is irrelevant as it is not required. This is a reassuring 
result as it is unlikely that S is defective but it is almost certain that C is 
defective, at least as currently stated. Also we have the following result. 

Lemma 10. CUG = TuG 

Proof: This is immediate from Lemma 1 and Lemma 3. 

This says that the ciquivalences obtained by combining invariance under rela- 
belling with conditionality are the same as the equivalences obtained by com- 
bining invariance under relabelling with likelihood. 

As with the proof of Birnbaum's theorem, the proof that C = L provided 
in Evans, Eraser and Monette (1986) is really a proof that C = L. This can 
be seen from the proof of Theorem 6. So accepting the relation C is not really 
equivalent to accepting L unless we agree that the additional elements of C 
make sense. This is essentially equivalent to saying that it doesn't matter which 
maximal ancillary we condition on and it is unlikely that this is acceptable to 
most frequentist statisticians and this is illustrated by the discussion concerning 
the example in Lemma 5. 

As noted in Durbin (1970), requiring that any ancillaries used in an ap- 
plication of C be functions of a minimal sufficient statistic voids Birnabum's 
proof, as the ancillary statistic used in the proof of Theorem 7 is not a func- 
tion of the sufficient statistic used in the proof. It is not clear, however, what 
this restriction does to the result C = L, but we note that there are situa- 
tions where there exist nonunique maximal ancillaries which are functions of 
the minimal sufficient statistic. In these circumstances we would still be forced 
to conclude the equivalence of inference bases derived by conditioning on the 
different maximal ancillaries if we reasoned as in Evans, Eraser and Monette 
(1986). Of course, we are arguing here that the result requires the statement of 
an additional hypothesis. 

4 Conclusions 

We have shown that the proof in Birnbaum (1962) did not prove that S and C 
lead to L. Rather the proof establishes that SUC = L and this is something 
quite different. The statement of Birnbaum's theorem in prose should have 
been: if we accept the relation S and we accept the relation C and we accept 
all the equivalences generated by S and C together, then this is equivalent to 
accepting L. The essential flaw in Birnbaum's theorem lies in excluding this last 
hypothesis from the statement of the theorem. The same qualification applies 
to the result proved in Evans, Eraser and Monette (1986) where the statement 
of the theorem should have been: if we accept the relation C and we accept all 
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the equivalences generated by C, then this is equivalent to accepting L. 

The way out of the difficulties posed by Birnbaum's theorem, and the result 
relating C and i, is to acknowledge that additional hypotheses are required for 
the results to hold. Certainly these results seem to lose their impact when they 
are correctly stated and we realize that an equivalence relation generated by a 
relation is not necessarily meaningful. It is necessary to provide an argument as 
to why the generated equivalence relation captures the essence of the relation 
that generates it and it is not at all clear how to do this in these cases. 

As we have noted, the essential result in all of this is C = L and this has 
some content albeit somewhat minor. Furthermore, the proof of this result is 
based on a defect in C, namely, it is not an equivalence relation due to the 
general nonexistence of unique maximal ancillaries. As such it is hard to accept 
C as stated as any kind of characterization of statistical evidence. Given the 
intuitive appeal of this relation in some simple examples, however, resolving 
the difficulties with C still poses a major challenge for a frequentitst theory of 
statistics. 
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